import pandas as pd
import numpy as np
import seaborn as sns
8 Data wrangling
Data wrangling refers to combining, transforming, and re-arranging data to make it suitable for further analysis. We’ll use Pandas for all data wrangling operations.
8.1 Hierarchical indexing
Until now we have seen only a single level of indexing in the rows and columns of a Pandas DataFrame. Hierarchical indexing refers to having multiple index levels on an axis (row / column) of a Pandas DataFrame. It helps us to work with a higher dimensional data in a lower dimensional form.
8.1.1 Hierarchical indexing in Pandas Series
Let us define Pandas Series as we defined in Chapter 5:
#Defining a Pandas Series
= pd.Series(['these','are','english','words','estas','son','palabras','en','español',
series_example 'ce','sont','des','françai','mots'])
series_example
0 these
1 are
2 english
3 words
4 estas
5 son
6 palabras
7 en
8 español
9 ce
10 sont
11 des
12 françai
13 mots
dtype: object
Let us use the attribute nlevels
to find the number of levels of the row indices of this Series:
series_example.index.nlevels
1
The Series series_example
has only one level of row indices.
Let us introduce another level of row indices while defining the Series:
#Defining a Pandas Series with multiple levels of row indices
= pd.Series(['these','are','english','words','estas','son','palabras','en','español',
series_example 'ce','sont','des','françai','mots'],
=[['English']*4+['Spanish']*5+['French']*5,list(range(1,5))+list(range(1,6))*2])
index series_example
English 1 these
2 are
3 english
4 words
Spanish 1 estas
2 son
3 palabras
4 en
5 español
French 1 ce
2 sont
3 des
4 françai
5 mots
dtype: object
In the above Series, there are two levels of row indices:
series_example.index.nlevels
2
8.1.2 Hierarchical indexing in Pandas DataFrame
In a Pandas DataFrame, both the rows and the columns can have hierarchical indexing. For example, consider the DataFrame below:
=np.array([[771517,2697000,815201,3849000],[4.2,5.6,2.8,4.6],
data7.8,234.5,46.9,502],[6749, 597, 52, 305]])
[= pd.DataFrame(data,index = [['Demographics']*2+['Geography']*2,
df_example 'Population','Unemployment (%)','Area (mile-sq)','Elevation (feet)']],
[= [['Illinois']*2+['California']*2,['Evanston','Chicago','San Francisco','Los Angeles']])
columns df_example
Illinois | California | ||||
---|---|---|---|---|---|
Evanston | Chicago | San Francisco | Los Angeles | ||
Demographics | Population | 771517.0 | 2697000.0 | 815201.0 | 3849000.0 |
Unemployment (%) | 4.2 | 5.6 | 2.8 | 4.6 | |
Geography | Area (mile-sq) | 7.8 | 234.5 | 46.9 | 502.0 |
Elevation (feet) | 6749.0 | 597.0 | 52.0 | 305.0 |
In the above DataFrame, both the rows and columns have 2 levels of indexing. The number of levels of column indices can be found using the attribute nlevels
:
df_example.columns.nlevels
2
The columns
attribute will now have a MultiIndex datatype in contrast to the Index datatype with single level of indexing. The same holds for row indices.
type(df_example.columns)
pandas.core.indexes.multi.MultiIndex
df_example.columns
MultiIndex([( 'Illinois', 'Evanston'),
( 'Illinois', 'Chicago'),
('California', 'San Francisco'),
('California', 'Los Angeles')],
)
The hierarchical levels can have names. Let us assign names to the each level of the row and column labels:
#Naming the row indices levels
=['Information type', 'Statistic']
df_example.index.names
#Naming the column indices levels
=['State', 'City']
df_example.columns.names
#Viewing the DataFrame
df_example
State | Illinois | California | |||
---|---|---|---|---|---|
City | Evanston | Chicago | San Francisco | Los Angeles | |
Information type | Statistic | ||||
Demographics | Population | 771517.0 | 2697000.0 | 815201.0 | 3849000.0 |
Unemployment (%) | 4.2 | 5.6 | 2.8 | 4.6 | |
Geography | Area (mile-sq) | 7.8 | 234.5 | 46.9 | 502.0 |
Elevation (feet) | 6749.0 | 597.0 | 52.0 | 305.0 |
Observe that the names of the row and column labels appear when we view the DataFrame.
8.1.2.1 get_level_values()
The names of the column levels can be obtained using the function get_level_values()
. The outer-most level corresponds to the level = 0, and it increases as we go to the inner levels.
#Column levels at level 0 (the outer level)
0) df_example.columns.get_level_values(
Index(['Illinois', 'Illinois', 'California', 'California'], dtype='object', name='State')
#Column levels at level 1 (the inner level)
1) df_example.columns.get_level_values(
Index(['Evanston', 'Chicago', 'San Francisco', 'Los Angeles'], dtype='object', name='City')
8.1.3 Subsetting data
We can use the indices at the outer levels to concisely subset a Series / DataFrame.
The first four observations of the Series series_example
correspond to the outer row index English
, while the last 5 rows correspond to the outer row index Spanish
. Let us subset all the observations corresponding to the outer row index English
:
#Subsetting data by row-index
'English'] series_example[
1 these
2 are
3 english
4 words
dtype: object
Just like in the case of single level indices, if we wish to subset corresponding to multiple outer-level indices, we put the indices within an additional box bracket []
. For example, let us subset all the observations corresponding to the row-indices English
and French
:
#Subsetting data by multiple row-indices
'English','French']] series_example[[
English 1 these
2 are
3 english
4 words
French 1 ce
2 sont
3 des
4 françai
5 mots
dtype: object
We can also subset data using the inner row index. However, we will need to put a :
sign to indicate that the row label at the inner level is being used.
#Subsetting data by row-index
2] series_example[:,
English are
Spanish son
French sont
dtype: object
#Subsetting data by multiple row-indices
1,2]] series_example.loc[:,[
English 1 these
Spanish 1 estas
French 1 ce
English 2 are
Spanish 2 son
French 2 sont
dtype: object
As in Series, we can concisely subset rows / columns in a DataFrame based on the index at the outer levels.
'Illinois'] df_example[
City | Evanston | Chicago | |
---|---|---|---|
Information type | Statistic | ||
Demographics | Population | 771517.0 | 2697000.0 |
Unemployment (%) | 4.2 | 5.6 | |
Geography | Area (mile-sq) | 7.8 | 234.5 |
Elevation (feet) | 6749.0 | 597.0 |
Note that the dataype of each column name is a tuple. For example, let us find the datatype of the \(1^{st}\) column name:
#First column name
0] df_example.columns[
('Illinois', 'Evanston')
#Datatype of first column name
type(df_example.columns[0])
tuple
Thus columns at the inner levels can be accessed by specifying the name as a tuple. For example, let us subset the column Evanston
:
#Subsetting the column 'Evanston'
'Illinois','Evanston')] df_example[(
Information type Statistic
Demographics Population 771517.0
Unemployment (%) 4.2
Geography Area (mile-sq) 7.8
Elevation (feet) 6749.0
Name: (Illinois, Evanston), dtype: float64
#Subsetting the columns 'Evanston' and 'Chicago' of the outer column level 'Illinois'
'Illinois',['Evanston','Chicago'])] df_example.loc[:,(
State | Illinois | ||
---|---|---|---|
City | Evanston | Chicago | |
Information type | Statistic | ||
Demographics | Population | 771517.0 | 2697000.0 |
Unemployment (%) | 4.2 | 5.6 | |
Geography | Area (mile-sq) | 7.8 | 234.5 |
Elevation (feet) | 6749.0 | 597.0 |
8.1.4 Practice exercise 1
Read the table consisting of GDP per capita of countries from the webpage: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita .
To only read the relevant table, read the tables that contain the word ‘Country’.
8.1.4.1
How many levels of indexing are there in the rows and columns?
= pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita', match = 'Country')
dfs = dfs[0]
gdp_per_capita gdp_per_capita.head()
Country/Territory | UN Region | IMF[4] | World Bank[5] | United Nations[6] | ||||
---|---|---|---|---|---|---|---|---|
Country/Territory | UN Region | Estimate | Year | Estimate | Year | Estimate | Year | |
0 | Liechtenstein * | Europe | — | — | 169049 | 2019 | 180227 | 2020 |
1 | Monaco * | Europe | — | — | 173688 | 2020 | 173696 | 2020 |
2 | Luxembourg * | Europe | 127673 | 2022 | 135683 | 2021 | 117182 | 2020 |
3 | Bermuda * | Americas | — | — | 110870 | 2021 | 123945 | 2020 |
4 | Ireland * | Europe | 102217 | 2022 | 99152 | 2021 | 86251 | 2020 |
Just by looking at the DataFrame, it seems as if there are two levels of indexing for columns and one level of indexing for rows. However, let us confirm it with the nlevels
attribute.
gdp_per_capita.columns.nlevels
2
Yes, there are 2 levels of indexing for columns.
gdp_per_capita.index.nlevels
1
There is one level of indexing for rows.
8.1.4.2
Subset a DataFrame that selects the country, and the United Nations’ estimates of GDP per capita with the corresponding year.
'Country/Territory','United Nations[6]']] gdp_per_capita.loc[:,[
Country/Territory | United Nations[6] | ||
---|---|---|---|
Country/Territory | Estimate | Year | |
0 | Liechtenstein * | 180227 | 2020 |
1 | Monaco * | 173696 | 2020 |
2 | Luxembourg * | 117182 | 2020 |
3 | Bermuda * | 123945 | 2020 |
4 | Ireland * | 86251 | 2020 |
... | ... | ... | ... |
217 | Madagascar * | 470 | 2020 |
218 | Central African Republic * | 481 | 2020 |
219 | Sierra Leone * | 475 | 2020 |
220 | South Sudan * | 1421 | 2020 |
221 | Burundi * | 286 | 2020 |
222 rows × 3 columns
8.1.4.3
Subset a DataFrame that selects only the World Bank and United Nations’ estimates of GDP per capita without the corresponding year or country.
'World Bank[5]','United Nations[6]'],'Estimate')] gdp_per_capita.loc[:,([
World Bank[5] | United Nations[6] | |
---|---|---|
Estimate | Estimate | |
0 | 169049 | 180227 |
1 | 173688 | 173696 |
2 | 135683 | 117182 |
3 | 110870 | 123945 |
4 | 99152 | 86251 |
... | ... | ... |
217 | 515 | 470 |
218 | 512 | 481 |
219 | 516 | 475 |
220 | 1120 | 1421 |
221 | 237 | 286 |
222 rows × 2 columns
8.1.4.4
Subset a DataFrame that selects the country and only the World Bank and United Nations’ estimates of GDP per capita without the corresponding year or country.
'Country/Territory','Country/Territory'),('United Nations[6]','Estimate'),('World Bank[5]','Estimate')]] gdp_per_capita.loc[:,[(
Country/Territory | United Nations[6] | World Bank[5] | |
---|---|---|---|
Country/Territory | Estimate | Estimate | |
0 | Liechtenstein * | 180227 | 169049 |
1 | Monaco * | 173696 | 173688 |
2 | Luxembourg * | 117182 | 135683 |
3 | Bermuda * | 123945 | 110870 |
4 | Ireland * | 86251 | 99152 |
... | ... | ... | ... |
217 | Madagascar * | 470 | 515 |
218 | Central African Republic * | 481 | 512 |
219 | Sierra Leone * | 475 | 516 |
220 | South Sudan * | 1421 | 1120 |
221 | Burundi * | 286 | 237 |
222 rows × 3 columns
8.1.4.5
Drop all columns consisting of years. Use the level
argument of the drop()
method.
= gdp_per_capita.drop(columns='Year',level=1)
gdp_per_capita gdp_per_capita
Country/Territory | UN Region | IMF[4] | World Bank[5] | United Nations[6] | |
---|---|---|---|---|---|
Country/Territory | UN Region | Estimate | Estimate | Estimate | |
0 | Liechtenstein * | Europe | — | 169049 | 180227 |
1 | Monaco * | Europe | — | 173688 | 173696 |
2 | Luxembourg * | Europe | 127673 | 135683 | 117182 |
3 | Bermuda * | Americas | — | 110870 | 123945 |
4 | Ireland * | Europe | 102217 | 99152 | 86251 |
... | ... | ... | ... | ... | ... |
217 | Madagascar * | Africa | 522 | 515 | 470 |
218 | Central African Republic * | Africa | 496 | 512 | 481 |
219 | Sierra Leone * | Africa | 494 | 516 | 475 |
220 | South Sudan * | Africa | 328 | 1120 | 1421 |
221 | Burundi * | Africa | 293 | 237 | 286 |
222 rows × 5 columns
8.1.4.6
In the dataset obtained above, drop the inner level of the column labels. Use the droplevel()
method.
= gdp_per_capita.droplevel(1,axis=1)
gdp_per_capita gdp_per_capita
Country/Territory | UN Region | IMF[4] | World Bank[5] | United Nations[6] | |
---|---|---|---|---|---|
0 | Liechtenstein * | Europe | — | 169049 | 180227 |
1 | Monaco * | Europe | — | 173688 | 173696 |
2 | Luxembourg * | Europe | 127673 | 135683 | 117182 |
3 | Bermuda * | Americas | — | 110870 | 123945 |
4 | Ireland * | Europe | 102217 | 99152 | 86251 |
... | ... | ... | ... | ... | ... |
217 | Madagascar * | Africa | 522 | 515 | 470 |
218 | Central African Republic * | Africa | 496 | 512 | 481 |
219 | Sierra Leone * | Africa | 494 | 516 | 475 |
220 | South Sudan * | Africa | 328 | 1120 | 1421 |
221 | Burundi * | Africa | 293 | 237 | 286 |
222 rows × 5 columns
8.1.5 Practice exercise 2
Recall problem 2(e) from assignment 3 on Pandas, where we needed to find the African country that is the closest to country \(G\) (Luxembourg) with regard to social indicators.
We will solve the question with the regular way in which we use single level of indexing (as you probably did during this assignment), and see if it is easier to do with hierarchical indexing.
Execute the code below that we used to pre-process data to make it suitable for answering this question.
#Pre-processing data - execute this code
= pd.read_csv("./Datasets/social_indicator.txt",sep="\t",index_col = 0)
social_indicator = social_indicator.geographic_location.apply(lambda x: 'Asia' if 'Asia' in x else 'Europe' if 'Europe' in x else 'Africa' if 'Africa' in x else x)
social_indicator.geographic_location ={'geographic_location':'continent'},inplace=True)
social_indicator.rename(columns= social_indicator.sort_index(axis=1)
social_indicator =['region','contraception'],inplace=True) social_indicator.drop(columns
Below is the code to find the African country that is the closest to country \(G\) (Luxembourg) using single level of indexing. Your code in the assignment is probably similar to the one below:
#Finding the index of the country G (Luxembourg) that has the maximum GDP per capita
= social_indicator.gdpPerCapita.argmax()
country_max_gdp_position
#Scaling the social indicator dataset
= social_indicator.iloc[:,2:].apply(lambda x: (x-x.mean())/(x.std()))
social_indicator_scaled
#Computing the Manhattan distances of all countries from country G (Luxembourg)
= (social_indicator_scaled-social_indicator_scaled.iloc[country_max_gdp_position,:]).abs().sum(axis=1)
manhattan_distances
#Finding the indices of African countries
= social_indicator.loc[social_indicator.continent=='Africa',:].index
african_countries_indices
#Filtering the Manhattan distances of African countries from country G (Luxembourg)
= manhattan_distances[african_countries_indices]
manhattan_distances_African
#Finding the country among African countries that has the least Manhattan distance to country G (Luxembourg)
'country'] social_indicator.loc[manhattan_distances_African.idxmin(),
'Reunion'
8.1.5.1
Use the method set_index()
to set continent
and country
as hierarchical indices of rows. Find the African country that is the closest to country \(G\) (Luxembourg) using this hierarchically indexed data. How many lines will be eliminated from the code above? Which lines will be eliminated?
Hint: Since continent
and country
are row indices, you don’t need to explicitly find:
The row index of country \(G\) (Luxembourg),
The row indices of African countries.
The Manhattan distances for African countries.
'continent','country'],inplace = True)
social_indicator.set_index([= social_indicator.apply(lambda x: (x-x.mean())/(x.std()))
social_indicator_scaled = (social_indicator_scaled-social_indicator_scaled.loc[('Europe','Luxembourg'),:]).abs().sum(axis=1)
manhattan_distances 'Africa'].idxmin() manhattan_distances[
'Reunion'
As we have converted the columns continent
and country
to row indices, all the lines of code where we were keeping track of the index of country \(G\), African countries, and Manhattan distances of African countries are eliminated. Three lines of code are eliminated.
Hierarchical indexing relieves us from keeping track of indices, if we set indices that are relatable to our analysis.
8.1.5.2
Use the Pandas DataFrame method mean()
with the level argument to find the mean value of all social indicators for each continent.
=0) social_indicator.mean(level
economicActivityFemale | economicActivityMale | gdpPerCapita | illiteracyFemale | illiteracyMale | infantMortality | lifeFemale | lifeMale | totalfertilityrate | |
---|---|---|---|---|---|---|---|---|---|
continent | |||||||||
Asia | 41.592683 | 79.282927 | 27796.390244 | 23.635951 | 13.780878 | 39.853659 | 70.724390 | 66.575610 | 3.482927 |
Africa | 46.732258 | 79.445161 | 7127.483871 | 52.907226 | 33.673548 | 77.967742 | 56.841935 | 53.367742 | 4.889677 |
Oceania | 51.280000 | 77.953333 | 14525.666667 | 9.666667 | 6.585200 | 23.666667 | 72.406667 | 67.813333 | 3.509333 |
North America | 45.238095 | 77.166667 | 18609.047619 | 17.390286 | 14.609905 | 22.904762 | 75.457143 | 70.161905 | 2.804286 |
South America | 42.008333 | 75.575000 | 15925.916667 | 9.991667 | 6.750000 | 34.750000 | 72.691667 | 66.975000 | 2.872500 |
Europe | 52.060000 | 70.291429 | 45438.200000 | 2.308343 | 1.413543 | 10.571429 | 77.757143 | 70.374286 | 1.581714 |
8.1.6 Practice exercise 3
Let us try to find the areas where NU students lack in diversity. Read survey_data_clean.csv. Use hierarchical indexing to classify the columns as follows:
Classify the following variables as lifestyle:
= ['fav_alcohol', 'parties_per_month', 'smoke', 'weed','streaming_platforms', 'minutes_ex_per_week',
lifestyle 'sleep_hours_per_day', 'internet_hours_per_day', 'procrastinator', 'num_clubs','student_athlete','social_media']
Classify the following variables as personality:
= ['introvert_extrovert', 'left_right_brained', 'personality_type',
personality 'num_insta_followers', 'fav_sport','learning_style','dominant_hand']
Classify the following variables as opinion:
= ['love_first_sight', 'expected_marriage_age', 'expected_starting_salary', 'how_happy',
opinion 'fav_number', 'fav_letter', 'fav_season', 'political_affliation', 'cant_change_math_ability',
'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
Classify the following variables as academic information:
= ['major', 'num_majors_minors',
academic_info 'high_school_GPA', 'NU_GPA', 'school_year','AP_stats', 'used_python_before']
Classify the following variables as demographics:
= [ 'only_child','birth_month',
demographics 'living_location_on_campus', 'age', 'height', 'height_father',
'height_mother', 'childhood_in_US', 'gender', 'region_of_residence']
Write a function that finds the number of variables having outliers in a dataset. Apply the function to each of the 5 categories of variables in the dataset. Our hypothesis is that the category that has the maximum number of variables with outliers has the least amount of diversity. For continuous variables, use Tukey’s fences criterion to identify outliers. For categorical variables, consider levels having less than 1% observations as outliers. Assume that numeric variables that have more than 2 distinct values are continuous.
Solution:
#Using hierarchical indexing to classify columns
#Reading data
= pd.read_csv('./Datasets/survey_data_clean.csv')
survey_data
#Arranging columns in the order of categories
= survey_data[lifestyle+personality+opinion+academic_info+demographics]
survey_data_HI
#Creating hierarchical indexing to classify columns
=[['lifestyle']*len(lifestyle)+['personality']*len(personality)+['opinion']*len(opinion)+\
survey_data_HI.columns'academic_info']*len(academic_info)+['demographics']*len(demographics),lifestyle+\
[+opinion+academic_info+demographics] personality
#Function to identify outliers based on Tukey's fences for continous variables and 1% criterion for categorical variables
def rem_outliers(x):
if ((len(x.value_counts())>2) & (x.dtype!='O')):#continuous variable
=x.quantile(0.25)
q1 = x.quantile(0.75)
q3 = q3-q1
intQ_range
#Tukey's fences
= q1 - 1.5*intQ_range
Lower_fence = q3 + 1.5*intQ_range
Upper_fence
= ((x<Lower_fence) | (x>Upper_fence)).sum()
num_outliers if num_outliers>0:
return True
return False
else: #categorical variable
if np.min(x.value_counts()/len(x))<0.01:
return True
return False
#Number of variables containing outlier(s) in each category
for category in survey_data_HI.columns.get_level_values(0).unique():
print("Number of missing values for category ",category," = ",survey_data_HI[category].apply(rem_outliers).sum())
Number of missing values for category lifestyle = 7
Number of missing values for category personality = 2
Number of missing values for category opinion = 4
Number of missing values for category academic_info = 3
Number of missing values for category demographics = 4
The lifestyle category has the highest number of variables containing outlier(s). If the hypothesis is true, then NU students have the least diversity in their lifestyle, among all the categories.
Although one may say that the lifestyle category has the the highest number of columns (as shown below), the proportion of columns having outlier(s) is also the highest for this category.
for category in survey_data_HI.columns.get_level_values(0).unique():
print("Number of columns in category ",category," = ",survey_data_HI[category].shape[1])
Number of columns in category lifestyle = 12
Number of columns in category personality = 7
Number of columns in category opinion = 12
Number of columns in category academic_info = 7
Number of columns in category demographics = 10
8.1.7 Reshaping data
Apart from ease in subsetting data, hierarchical indexing also plays a role in reshaping data.
8.1.7.1 unstack()
(Pandas Series method)
The Pandas Series method unstack()
pivots the desired level of row indices to columns, thereby creating a DataFrame. By default, the inner-most level of the row labels is pivoted.
#Pivoting the inner-most Series row index to column labels
= series_example.unstack()
series_example_unstack series_example_unstack
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
English | these | are | english | words | NaN |
French | ce | sont | des | françai | mots |
Spanish | estas | son | palabras | en | español |
We can pivot the outer level of the row labels by specifying it in the level
argument:
#Pivoting the outer row indices to column labels
= series_example.unstack(level=0)
series_example_unstack series_example_unstack
English | French | Spanish | |
---|---|---|---|
1 | these | ce | estas |
2 | are | sont | son |
3 | english | des | palabras |
4 | words | françai | en |
5 | NaN | mots | español |
8.1.7.2 unstack()
(Pandas DataFrame method)
The Pandas DataFrame method unstack()
pivots the specified level of row indices to the new inner-most level of column labels. By default, the inner-most level of the row labels is pivoted.
#Pivoting the inner level of row labels to the inner-most level of column labels
df_example.unstack()
State | Illinois | California | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
City | Evanston | Chicago | San Francisco | Los Angeles | ||||||||||||
Statistic | Area (mile-sq) | Elevation (feet) | Population | Unemployement (%) | Area (mile-sq) | Elevation (feet) | Population | Unemployement (%) | Area (mile-sq) | Elevation (feet) | Population | Unemployement (%) | Area (mile-sq) | Elevation (feet) | Population | Unemployement (%) |
Information type | ||||||||||||||||
Demographics | NaN | NaN | 771517.0 | 4.2 | NaN | NaN | 2697000.0 | 5.6 | NaN | NaN | 815201.0 | 2.8 | NaN | NaN | 3849000.0 | 4.6 |
Geography | 7.8 | 6749.0 | NaN | NaN | 234.5 | 597.0 | NaN | NaN | 46.9 | 52.0 | NaN | NaN | 502.0 | 305.0 | NaN | NaN |
As with Series, we can pivot the outer level of the row labels by specifying it in the level
argument:
#Pivoting the outer level (level = 0) of row labels to the inner-most level of column labels
=0) df_example.unstack(level
State | Illinois | California | ||||||
---|---|---|---|---|---|---|---|---|
City | Evanston | Chicago | San Francisco | Los Angeles | ||||
Information type | Demographics | Geography | Demographics | Geography | Demographics | Geography | Demographics | Geography |
Statistic | ||||||||
Area (mile-sq) | NaN | 7.8 | NaN | 234.5 | NaN | 46.9 | NaN | 502.0 |
Elevation (feet) | NaN | 6749.0 | NaN | 597.0 | NaN | 52.0 | NaN | 305.0 |
Population | 771517.0 | NaN | 2697000.0 | NaN | 815201.0 | NaN | 3849000.0 | NaN |
Unemployement (%) | 4.2 | NaN | 5.6 | NaN | 2.8 | NaN | 4.6 | NaN |
8.1.7.3 stack()
The inverse of unstack()
is the stack()
method, which creates the inner-most level of row indices by pivoting the column labels of the prescribed level.
Note that if the column labels have only one level, we don’t need to specify a level.
#Stacking the columns of a DataFrame
series_example_unstack.stack()
English 1 these
2 are
3 english
4 words
French 1 ce
2 sont
3 des
4 françai
5 mots
Spanish 1 estas
2 son
3 palabras
4 en
5 español
dtype: object
However, if the columns have multiple levels, we can specify the level to stack as the inner-most row level. By default, the inner-most column level is stacked.
#Stacking the inner-most column labels inner-most row indices
df_example.stack()
State | California | Illinois | ||
---|---|---|---|---|
Information type | Statistic | City | ||
Demographics | Population | Chicago | NaN | 2697000.0 |
Evanston | NaN | 771517.0 | ||
Los Angeles | 3849000.0 | NaN | ||
San Francisco | 815201.0 | NaN | ||
Unemployement (%) | Chicago | NaN | 5.6 | |
Evanston | NaN | 4.2 | ||
Los Angeles | 4.6 | NaN | ||
San Francisco | 2.8 | NaN | ||
Geography | Area (mile-sq) | Chicago | NaN | 234.5 |
Evanston | NaN | 7.8 | ||
Los Angeles | 502.0 | NaN | ||
San Francisco | 46.9 | NaN | ||
Elevation (feet) | Chicago | NaN | 597.0 | |
Evanston | NaN | 6749.0 | ||
Los Angeles | 305.0 | NaN | ||
San Francisco | 52.0 | NaN |
#Stacking the outer column labels inner-most row indices
=0) df_example.stack(level
City | Chicago | Evanston | Los Angeles | San Francisco | ||
---|---|---|---|---|---|---|
Information type | Statistic | State | ||||
Demographics | Population | California | NaN | NaN | 3849000.0 | 815201.0 |
Illinois | 2697000.0 | 771517.0 | NaN | NaN | ||
Unemployement (%) | California | NaN | NaN | 4.6 | 2.8 | |
Illinois | 5.6 | 4.2 | NaN | NaN | ||
Geography | Area (mile-sq) | California | NaN | NaN | 502.0 | 46.9 |
Illinois | 234.5 | 7.8 | NaN | NaN | ||
Elevation (feet) | California | NaN | NaN | 305.0 | 52.0 | |
Illinois | 597.0 | 6749.0 | NaN | NaN |
8.2 Merging data
The Pandas DataFrame method merge() uses columns defined as key column(s) to merge two datasets. In case the key column(s) are not defined, the overlapping column(s) are considered as the key columns.
8.2.1 Join types
When a dataset is merged with another based on key column(s), one of the following four types of join will occur depending on the repetition of the values of the key(s) in the datasets.
- One-to-one, (ii) Many-to-one, (iii) One-to-Many, and (iv) Many-to-many
The type of join may sometimes determine the number of rows to be obtained in the merged dataset. If we don’t get the expected number of rows in the merged dataset, an investigation of the datsets may be neccessary to identify and resolve the issue. There may be several possible issues, for example, the dataset may not be arranged in a way that we have assumed it to be arranged.
We’ll use toy datasets to understand the above types of joins. The .csv files with the prefix student consist of the names of a few students along with their majors, and the files with the prefix skills consist of the names of majors along with the skills imparted by the respective majors.
= pd.read_csv('./Datasets/student_one.csv')
data_student = pd.read_csv('./Datasets/skills_one.csv') data_skill
8.2.1.1 One-to-one join
Each row in one dataset is linked (or related) to a single row in another dataset based on the key column(s).
data_student
Student | Major | |
---|---|---|
0 | Kitana | Statistics |
1 | Jax | Computer Science |
2 | Sonya | Material Science |
3 | Johnny | Music |
data_skill
Major | Skills | |
---|---|---|
0 | Statistics | Inference |
1 | Computer Science | Machine learning |
2 | Material Science | Structure prediction |
3 | Music | Opera |
pd.merge(data_student,data_skill)
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Jax | Computer Science | Machine learning |
2 | Sonya | Material Science | Structure prediction |
3 | Johnny | Music | Opera |
8.2.1.2 Many-to-one join
One or more rows in one dataset is linked (or related) to a single row in another dataset based on the key column(s).
= pd.read_csv('./Datasets/student_many.csv')
data_student = pd.read_csv('./Datasets/skills_one.csv') data_skill
data_student
Student | Major | |
---|---|---|
0 | Kitana | Statistics |
1 | Kitana | Computer Science |
2 | Jax | Computer Science |
3 | Sonya | Material Science |
4 | Johnny | Music |
5 | Johnny | Statistics |
data_skill
Major | Skills | |
---|---|---|
0 | Statistics | Inference |
1 | Computer Science | Machine learning |
2 | Material Science | Structure prediction |
3 | Music | Opera |
pd.merge(data_student,data_skill)
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Johnny | Statistics | Inference |
2 | Kitana | Computer Science | Machine learning |
3 | Jax | Computer Science | Machine learning |
4 | Sonya | Material Science | Structure prediction |
5 | Johnny | Music | Opera |
8.2.1.3 One-to-many join
Each row in one dataset is linked (or related) to one, or more rows in another dataset based on the key column(s).
= pd.read_csv('./Datasets/student_one.csv')
data_student = pd.read_csv('./Datasets/skills_many.csv') data_skill
data_student
Student | Major | |
---|---|---|
0 | Kitana | Statistics |
1 | Jax | Computer Science |
2 | Sonya | Material Science |
3 | Johnny | Music |
data_skill
Major | Skills | |
---|---|---|
0 | Statistics | Inference |
1 | Statistics | Modeling |
2 | Computer Science | Machine learning |
3 | Computer Science | Computing |
4 | Material Science | Structure prediction |
5 | Music | Opera |
6 | Music | Pop |
7 | Music | Classical |
pd.merge(data_student,data_skill)
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Kitana | Statistics | Modeling |
2 | Jax | Computer Science | Machine learning |
3 | Jax | Computer Science | Computing |
4 | Sonya | Material Science | Structure prediction |
5 | Johnny | Music | Opera |
6 | Johnny | Music | Pop |
7 | Johnny | Music | Classical |
8.2.1.4 Many-to-many join
One, or more, rows in one dataset is linked (or related) to one, or more, rows in another dataset using the key column(s).
= pd.read_csv('./Datasets/student_many.csv')
data_student = pd.read_csv('./Datasets/skills_many.csv') data_skill
data_student
Student | Major | |
---|---|---|
0 | Kitana | Statistics |
1 | Kitana | Computer Science |
2 | Jax | Computer Science |
3 | Sonya | Material Science |
4 | Johnny | Music |
5 | Johnny | Statistics |
data_skill
Major | Skills | |
---|---|---|
0 | Statistics | Inference |
1 | Statistics | Modeling |
2 | Computer Science | Machine learning |
3 | Computer Science | Computing |
4 | Material Science | Structure prediction |
5 | Music | Opera |
6 | Music | Pop |
7 | Music | Classical |
pd.merge(data_student,data_skill)
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Kitana | Statistics | Modeling |
2 | Johnny | Statistics | Inference |
3 | Johnny | Statistics | Modeling |
4 | Kitana | Computer Science | Machine learning |
5 | Kitana | Computer Science | Computing |
6 | Jax | Computer Science | Machine learning |
7 | Jax | Computer Science | Computing |
8 | Sonya | Material Science | Structure prediction |
9 | Johnny | Music | Opera |
10 | Johnny | Music | Pop |
11 | Johnny | Music | Classical |
Note that there are two ‘Statistics’ rows in data_student
, and two ‘Statistics’ rows in data_skill
, resulting in 2x2 = 4 ‘Statistics’ rows in the merged data. The same is true for the ‘Computer Science’ Major
.
8.2.2 Join types with how
argument
The above mentioned types of join (one-to-one, many-to-one, etc.) occur depening on the structure of the datasets being merged. We don’t have control over the type of join. However, we can control how
the joins are occurring. We can merge (or join) two datasets in one of the following four ways:
inner
join, (ii)left
join, (iii)right
join, (iv)outer
join
8.2.2.1 inner join
This is the join that occurs by default, i.e., without specifying the how
argument in the merge()
function. In inner
join, only those observations are merged that have the same value(s) in the key column(s) of both the datasets.
= pd.read_csv('./Datasets/student_how.csv')
data_student = pd.read_csv('./Datasets/skills_how.csv') data_skill
data_student
Student | Major | |
---|---|---|
0 | Kitana | Statistics |
1 | Jax | Computer Science |
2 | Sonya | Material Science |
data_skill
Major | Skills | |
---|---|---|
0 | Statistics | Inference |
1 | Computer Science | Machine learning |
2 | Music | Opera |
pd.merge(data_student,data_skill)
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Jax | Computer Science | Machine learning |
When you may use inner join? You should use inner join when you cannot carry out the analysis unless the observation corresponding to the key column(s) is present in both the tables.
Example: Suppose you wish to analyze the association between vaccinations and covid infection rate based on country-level data. In one of the datasets, you have the infection rate for each country, while in the other one you have the number of vaccinations in each country. The countries which have either the vaccination or the infection rate missing, cannot help analyze the association. In such as case you may be interested only in countries that have values for both the variables. Thus, you will use inner join to discard the countries with either value missing.
8.2.2.2 left join
In left
join, the merged dataset will have all the rows of the dataset that is specified first in the merge()
function. Only those observations of the other dataset will be merged whose value(s) in the key column(s) exist in the dataset specified first in the merge()
function.
='left') pd.merge(data_student,data_skill,how
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Jax | Computer Science | Machine learning |
2 | Sonya | Material Science | NaN |
When you may use left join? You should use left join when the primary variable(s) of interest are present in the one of the datasets, and whose missing values cannot be imputed. The variable(s) in the other dataset may not be as important or it may be possible to reasonably impute their values, if missing corresponding to the observation in the primary dataset.
Examples:
Suppose you wish to analyze the association between the covid infection rate and the government effectiveness score (a metric used to determine the effectiveness of the government in implementing policies, upholding law and order etc.) based on the data of all countries. Let us say that one of the datasets contains the covid infection rate, while the other one contains the government effectiveness score for each country. If the infection rate for a country is missing, it might be hard to impute. However, the government effectiveness score may be easier to impute based on GDP per capita, crime rate etc. - information that is easily available online. In such a case, you may wish to use a left join where you keep all the countries for which the infection rate is known.
Suppose you wish to analyze the association between demographics such as age, income etc. and the amount of credit card spend. Let us say one of the datasets contains the demographic information of each customer, while the other one contains the credit card spend for the customers who made at least one purchase. In such as case, you may want to do a left join as customers not making any purchase might be absent in the card spend data. Their spend can be imputed as zero after merging the datasets.
8.2.2.3 right join
In right
join, the merged dataset will have all the rows of the dataset that is specified second in the merge()
function. Only those observations of the other dataset will be merged whose value(s) in the key column(s) exist in the dataset specified second in the merge()
function.
='right') pd.merge(data_student,data_skill,how
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Jax | Computer Science | Machine learning |
2 | NaN | Music | Opera |
When you may use right join? You can always use a left join instead of a right join. Their purpose is the same.
8.2.2.4 outer join
In outer
join, the merged dataset will have all the rows of both the datasets being merged.
='outer') pd.merge(data_student,data_skill,how
Student | Major | Skills | |
---|---|---|---|
0 | Kitana | Statistics | Inference |
1 | Jax | Computer Science | Machine learning |
2 | Sonya | Material Science | NaN |
3 | NaN | Music | Opera |
When you may use outer join? You should use an outer join when you cannot afford to lose data present in either of the tables. All the other joins may result in loss of data.
Example: Suppose I took two course surveys for this course. If I need to analyze student sentiment during the course, I will take an outer join of both the surveys. Assume that each survey is a dataset, where each row corresponds to a unique student. Even if a student has answered one of the two surverys, it will be indicative of the sentiment, and will be useful to keep in the merged dataset.
8.3 Concatenating datasets
The Pandas DataFrame method concat()
is used to stack datasets along an axis. The method is similar to NumPy’s concatenate()
method.
Example: You are given the life expectancy data of each continent as a separate *.csv file. Visualize the change of life expectancy over time for different continents.
= pd.read_csv('./Datasets/gdp_lifeExpec_Asia.csv')
data_asia = pd.read_csv('./Datasets/gdp_lifeExpec_Europe.csv')
data_europe = pd.read_csv('./Datasets/gdp_lifeExpec_Africa.csv')
data_africa = pd.read_csv('./Datasets/gdp_lifeExpec_Oceania.csv')
data_oceania = pd.read_csv('./Datasets/gdp_lifeExpec_Americas.csv') data_americas
#Appending all the data files, i.e., stacking them on top of each other
= pd.concat([data_asia,data_europe,data_africa,data_oceania,data_americas],keys = ['Asia','Europe','Africa','Oceania','Americas'])
data_all_continents data_all_continents
country | year | lifeExp | pop | gdpPercap | ||
---|---|---|---|---|---|---|
Asia | 0 | Afghanistan | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Afghanistan | 1957 | 30.332 | 9240934 | 820.853030 | |
2 | Afghanistan | 1962 | 31.997 | 10267083 | 853.100710 | |
3 | Afghanistan | 1967 | 34.020 | 11537966 | 836.197138 | |
4 | Afghanistan | 1972 | 36.088 | 13079460 | 739.981106 | |
... | ... | ... | ... | ... | ... | ... |
Americas | 295 | Venezuela | 1987 | 70.190 | 17910182 | 9883.584648 |
296 | Venezuela | 1992 | 71.150 | 20265563 | 10733.926310 | |
297 | Venezuela | 1997 | 72.146 | 22374398 | 10165.495180 | |
298 | Venezuela | 2002 | 72.766 | 24287670 | 8605.047831 | |
299 | Venezuela | 2007 | 73.747 | 26084662 | 11415.805690 |
1704 rows × 5 columns
Let’s have the continent as a column as we need to use that in the visualization.
= True) data_all_continents.reset_index(inplace
data_all_continents.head()
level_0 | level_1 | country | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|---|
0 | Asia | 0 | Afghanistan | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Asia | 1 | Afghanistan | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Asia | 2 | Afghanistan | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Asia | 3 | Afghanistan | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Asia | 4 | Afghanistan | 1972 | 36.088 | 13079460 | 739.981106 |
= 'level_1',inplace = True)
data_all_continents.drop(columns = {'level_0':'continent'},inplace = True)
data_all_continents.rename(columns data_all_continents.head()
continent | country | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.981106 |
#change of life expectancy over time for different continents
= sns.FacetGrid(data_all_continents,col = 'continent',col_wrap = 3,height = 4.5,aspect = 1)#height = 3,aspect = 0.8)
a map(sns.lineplot,'year','lifeExp')
a. a.add_legend()
In the above example, datasets were appended (or stacked on top of each other).
Datasets can also be concatenated side-by-side (by providing the argument axis = 1 with the concat() function) as we saw with the merge function.
8.3.1 Practice exercise 4
Read the documentations of the Pandas DataFrame methods merge()
and concat()
, and identify the differences. Mention examples when you can use (i) either, (ii) only concat()
, (iii) only merge()
Solution:
If we need to merge datasets using row indices, we can use either function.
If we need to stack datasets one on top of the other, we can only use
concat()
If we need to merge datasets using overlapping columns we can only use
merge()
8.4 Reshaping data
Data often needs to be re-arranged to ease analysis.
8.4.1 Pivoting “long” to “wide” format
pivot()
This function helps re-arrange data from the ‘long’ form to a ‘wide’ form.
Example: Let us consider the dataset data_all_continents
obtained in the previous section after concatenating the data of all the continents.
data_all_continents.head()
continent | country | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
0 | Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.445314 |
1 | Asia | Afghanistan | 1957 | 30.332 | 9240934 | 820.853030 |
2 | Asia | Afghanistan | 1962 | 31.997 | 10267083 | 853.100710 |
3 | Asia | Afghanistan | 1967 | 34.020 | 11537966 | 836.197138 |
4 | Asia | Afghanistan | 1972 | 36.088 | 13079460 | 739.981106 |
8.4.1.1 Pivoting a single column
For visualizing life expectancy in 2007 against life expectancy in 1957, we will need to filter the data, and then make the plot. Everytime that we need to compare a metric for a year against another year, we will need to filter the data.
If we need to often compare metrics of a year against another year, it will be easier to have each year as a separate column, instead of having all years in a single column.
As we are increasing the number of columns and decreasing the number of rows, we are re-arranging the data from long-form to wide-form.
= data_all_continents.pivot(index = ['continent','country'],columns = 'year',values = 'lifeExp') data_wide
data_wide.head()
year | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
continent | country | ||||||||||||
Africa | Algeria | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | 70.994 | 72.301 |
Angola | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | 41.003 | 42.731 | |
Benin | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | 54.406 | 56.728 | |
Botswana | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | 46.634 | 50.728 | |
Burkina Faso | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | 50.650 | 52.295 |
With values of year
as columns, it is easy to compare any metric for different years.
#visualizing the change in life expectancy of all countries in 2007 as compared to that in 1957, i.e., the overall change in life expectancy in 50 years.
= data_wide, x = 1957,y=2007,hue = 'continent')
sns.scatterplot(data = data_wide, x = 1957,y = 1957) sns.lineplot(data
Observe that for some African countries, the life expectancy has decreased after 50 years. It is worth investigating these countries to identify factors associated with the decrease.
8.4.1.2 Pivoting multiple columns
In the above transformation, we retained only lifeExp
in the ‘wide’ dataset. Suppose, we are also interested in visualizing GDP per capita of countries in one year against another year. In that case, we must have gdpPercap
in the ’wide’-form data as well.
Let us create a dataset named as data_wide_lifeExp_gdpPercap
that will contain both lifeExp
and gdpPercap
for each year in a separate column. We will specify the columns to pivot in the values argument of the pivot()
function.
= data_all_continents.pivot(index = ['continent','country'],columns = 'year',values = ['lifeExp','gdpPercap'])
data_wide_lifeExp_gdpPercap data_wide_lifeExp_gdpPercap.head()
lifeExp | ... | gdpPercap | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | ... | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
continent | country | |||||||||||||||||||||
Africa | Algeria | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | ... | 2550.816880 | 3246.991771 | 4182.663766 | 4910.416756 | 5745.160213 | 5681.358539 | 5023.216647 | 4797.295051 | 5288.040382 | 6223.367465 |
Angola | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | ... | 4269.276742 | 5522.776375 | 5473.288005 | 3008.647355 | 2756.953672 | 2430.208311 | 2627.845685 | 2277.140884 | 2773.287312 | 4797.231267 | |
Benin | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | ... | 949.499064 | 1035.831411 | 1085.796879 | 1029.161251 | 1277.897616 | 1225.856010 | 1191.207681 | 1232.975292 | 1372.877931 | 1441.284873 | |
Botswana | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | ... | 983.653976 | 1214.709294 | 2263.611114 | 3214.857818 | 4551.142150 | 6205.883850 | 7954.111645 | 8647.142313 | 11003.605080 | 12569.851770 | |
Burkina Faso | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | ... | 722.512021 | 794.826560 | 854.735976 | 743.387037 | 807.198586 | 912.063142 | 931.752773 | 946.294962 | 1037.645221 | 1217.032994 |
5 rows × 24 columns
The metric for each year is now in a separate column, and can be visualized directly. Note that re-arranging the dataset from the ‘long’-form to ‘wide-form’ leads to hierarchical indexing of columns when multiple ‘values’ need to be re-arranged. In this case, the multiple ‘values’ that need to be re-arranged are lifeExp
and gdpPercap
.
8.4.2 Melting “wide” to “long” format
melt()
This function is used to re-arrange the dataset from the ‘wide’ form to the ‘long’ form.
8.4.2.1 Melting columns with a single type of value
Let us consider data_wide
created in the previous section.
data_wide.head()
year | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
continent | country | ||||||||||||
Africa | Algeria | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | 70.994 | 72.301 |
Angola | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | 41.003 | 42.731 | |
Benin | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | 54.406 | 56.728 | |
Botswana | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | 46.634 | 50.728 | |
Burkina Faso | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | 50.650 | 52.295 |
Suppose, we wish to visualize the change of life expectancy over time for different continents, as we did in section 8.3. For plotting lifeExp
against year
, all the years must be in a single column. Thus, we need to melt the columns of data_wide
to a single column and call it year
.
But before melting the columns in the above dataset, we will convert continent
to a column, as we need to make subplots based on continent.
The Pandas DataFrame method reset_index()
can be used to remove one or more levels of indexing from the DataFrame.
#Making 'continent' a column instead of row-index at level 0
=True,level=0)
data_wide.reset_index(inplace data_wide.head()
year | continent | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | |||||||||||||
Algeria | Africa | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | 70.994 | 72.301 |
Angola | Africa | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | 41.003 | 42.731 |
Benin | Africa | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | 54.406 | 56.728 |
Botswana | Africa | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | 46.634 | 50.728 |
Burkina Faso | Africa | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | 50.650 | 52.295 |
=pd.melt(data_wide,id_vars = ['continent'],var_name = 'Year',value_name = 'LifeExp')
data_melted data_melted.head()
continent | Year | LifeExp | |
---|---|---|---|
0 | Africa | 1952 | 43.077 |
1 | Africa | 1952 | 30.015 |
2 | Africa | 1952 | 38.223 |
3 | Africa | 1952 | 47.622 |
4 | Africa | 1952 | 31.975 |
With the above DataFrame, we can visualize the mean life expectancy against year separately for each continent.
If we wish to have country also in the above data, we can keep it while resetting the index:
#Creating 'data_wide' again
= data_all_continents.pivot(index = ['continent','country'],columns = 'year',values = 'lifeExp')
data_wide
#Resetting the row-indices to default values
=True)
data_wide.reset_index(inplace data_wide.head()
year | continent | country | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Africa | Algeria | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | 70.994 | 72.301 |
1 | Africa | Angola | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | 41.003 | 42.731 |
2 | Africa | Benin | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | 54.406 | 56.728 |
3 | Africa | Botswana | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | 46.634 | 50.728 |
4 | Africa | Burkina Faso | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | 50.650 | 52.295 |
#Melting the 'year' column
=pd.melt(data_wide,id_vars = ['continent','country'],var_name = 'Year',value_name = 'LifeExp')
data_melted data_melted.head()
continent | country | Year | LifeExp | |
---|---|---|---|---|
0 | Africa | Algeria | 1952 | 43.077 |
1 | Africa | Angola | 1952 | 30.015 |
2 | Africa | Benin | 1952 | 38.223 |
3 | Africa | Botswana | 1952 | 47.622 |
4 | Africa | Burkina Faso | 1952 | 31.975 |
8.4.2.2 Melting columns with multiple types of values
Consider the dataset created in Section 8.4.1.2. It has two types of values - lifeExp
and gdpPercapita
, which are the column labels at the outer level. The melt()
function will melt all the years of data into a single column. However, it will create another column based on the outer level column labels - lifeExp
and gdpPercapita
to distinguish between these two types of values. Here, we see that the function melt()
internally uses hierarchical indexing to handle the transformation of multiple types of columns.
= pd.melt(data_wide_lifeExp_gdpPercap.reset_index(),id_vars = ['continent','country'],var_name = ['Metric','year'])
data_melt data_melt.head()
continent | country | Metric | year | value | |
---|---|---|---|---|---|
0 | Africa | Algeria | lifeExp | 1952 | 43.077 |
1 | Africa | Angola | lifeExp | 1952 | 30.015 |
2 | Africa | Benin | lifeExp | 1952 | 38.223 |
3 | Africa | Botswana | lifeExp | 1952 | 47.622 |
4 | Africa | Burkina Faso | lifeExp | 1952 | 31.975 |
Although the data above is in ‘long’-form, it is not quiet in its original format, as in data_all_continents
. We need to pivot again by Metric
to have two separate columns of gdpPercap
and lifeExp
.
= data_melt.pivot(index = ['continent','country','year'],columns = 'Metric')
data_restore data_restore.head()
value | ||||
---|---|---|---|---|
Metric | gdpPercap | lifeExp | ||
continent | country | year | ||
Africa | Algeria | 1952 | 2449.008185 | 43.077 |
1957 | 3013.976023 | 45.685 | ||
1962 | 2550.816880 | 48.303 | ||
1967 | 3246.991771 | 51.407 | ||
1972 | 4182.663766 | 54.518 |
Now, we can convert the row indices of continent
and country
to columns to restore the dataset to the same form as data_all_continents
.
= True)
data_restore.reset_index(inplace data_restore.head()
continent | country | year | value | ||
---|---|---|---|---|---|
Metric | gdpPercap | lifeExp | |||
0 | Africa | Algeria | 1952 | 2449.008185 | 43.077 |
1 | Africa | Algeria | 1957 | 3013.976023 | 45.685 |
2 | Africa | Algeria | 1962 | 2550.816880 | 48.303 |
3 | Africa | Algeria | 1967 | 3246.991771 | 51.407 |
4 | Africa | Algeria | 1972 | 4182.663766 | 54.518 |
8.4.3 Practice exercise 5
8.4.3.1
Both unstack()
and pivot()
seem to transform the data from the ‘long’ form to the ‘wide’ form. Is there a difference between the two functions?
Solution:
Yes, both the functions transform the data from the ‘long’ form to the ‘wide’ form. However, unstack()
pivots the row indices, while pivot()
pivots the columns of the DataFrame.
Even though both functions are a bit different, it is possible to just use one of them to perform a reshaping operation. If we wish to pivot a column, we can either use pivot()
directly on the column, or we can convert the column to row indices and then use unstack()
. If we wish to pivot row indices, we can either use unstack()
directly on the row indices, or we can convert row indices to a column and then use pivot()
.
To summarise, using one function may be more straightforward than using the other one, but either can be used for reshaping data from the ‘long’ form to the ‘wide’ form.
Below is an example where we perform the same reshaping operation with either function.
Consider the data data_all_continent
. Suppose we wish to transform it to data_wide
as we did using pivot()
in Section 8.4.1.1. Let us do it using unstack()
, instead of pivot()
.
The first step will be to reindex data to set year
as row indices, and also continent
and country
as row indices because these two column were set as indices with the pivot()
function in Section 8.4.1.1.
#Reindexing data to make 'continent', 'country', and 'year' as hierarchical row indices
=data_all_continents.set_index(['continent','country','year'])
data_reindexed data_reindexed
lifeExp | pop | gdpPercap | |||
---|---|---|---|---|---|
continent | country | year | |||
Asia | Afghanistan | 1952 | 28.801 | 8425333 | 779.445314 |
1957 | 30.332 | 9240934 | 820.853030 | ||
1962 | 31.997 | 10267083 | 853.100710 | ||
1967 | 34.020 | 11537966 | 836.197138 | ||
1972 | 36.088 | 13079460 | 739.981106 | ||
... | ... | ... | ... | ... | ... |
Americas | Venezuela | 1987 | 70.190 | 17910182 | 9883.584648 |
1992 | 71.150 | 20265563 | 10733.926310 | ||
1997 | 72.146 | 22374398 | 10165.495180 | ||
2002 | 72.766 | 24287670 | 8605.047831 | ||
2007 | 73.747 | 26084662 | 11415.805690 |
1704 rows × 3 columns
Now we can use unstack()
to pivot the desired row index, i.e., year
. Also, since we are only interested in pivoting the values of lifeExp
(as in the example in Section 8.4.1.1), we will filter the pivoted data with the lifeExp
column label.
=data_reindexed.unstack('year')['lifeExp']
data_wide_with_unstack data_wide_with_unstack
year | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
continent | country | ||||||||||||
Africa | Algeria | 43.077 | 45.685 | 48.303 | 51.407 | 54.518 | 58.014 | 61.368 | 65.799 | 67.744 | 69.152 | 70.994 | 72.301 |
Angola | 30.015 | 31.999 | 34.000 | 35.985 | 37.928 | 39.483 | 39.942 | 39.906 | 40.647 | 40.963 | 41.003 | 42.731 | |
Benin | 38.223 | 40.358 | 42.618 | 44.885 | 47.014 | 49.190 | 50.904 | 52.337 | 53.919 | 54.777 | 54.406 | 56.728 | |
Botswana | 47.622 | 49.618 | 51.520 | 53.298 | 56.024 | 59.319 | 61.484 | 63.622 | 62.745 | 52.556 | 46.634 | 50.728 | |
Burkina Faso | 31.975 | 34.906 | 37.814 | 40.697 | 43.591 | 46.137 | 48.122 | 49.557 | 50.260 | 50.324 | 50.650 | 52.295 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Europe | Switzerland | 69.620 | 70.560 | 71.320 | 72.770 | 73.780 | 75.390 | 76.210 | 77.410 | 78.030 | 79.370 | 80.620 | 81.701 |
Turkey | 43.585 | 48.079 | 52.098 | 54.336 | 57.005 | 59.507 | 61.036 | 63.108 | 66.146 | 68.835 | 70.845 | 71.777 | |
United Kingdom | 69.180 | 70.420 | 70.760 | 71.360 | 72.010 | 72.760 | 74.040 | 75.007 | 76.420 | 77.218 | 78.471 | 79.425 | |
Oceania | Australia | 69.120 | 70.330 | 70.930 | 71.100 | 71.930 | 73.490 | 74.740 | 76.320 | 77.560 | 78.830 | 80.370 | 81.235 |
New Zealand | 69.390 | 70.260 | 71.240 | 71.520 | 71.890 | 72.220 | 73.840 | 74.320 | 76.330 | 77.550 | 79.110 | 80.204 |
142 rows × 12 columns
The above dataset is the same as that obtained using the pivot()
function in Section 8.4.1.1.
8.4.3.2
Both stack()
and melt()
seem to transform the data from the ‘wide’ form to the ‘long’ form. Is there a difference between the two functions?
Solution:
Following the trend of the previous question, we can always use stack()
instead of melt()
and vice-versa. The main difference is that melt()
lets us choose the indentifier columns with the argument id_vars
. However, if we use stack()
, we will need to set the relevant melted row indices as columns. On the other hand, if we wished to have the melted columns as row indices, we can either directly use stack()
or use melt()
and then set the desired columns as row indices.
To summarise, using one function may be more straightforward than using the other one, but either can be used for reshaping data from the ‘wide’ form to the ‘long’ form.
Let us melt the data data_wide_with_unstack
using the stack()
function to obtain the same dataset as obtained with the melt()
function in Section 8.4.1.2.
#Stacking the data
= data_wide_with_unstack.stack()
data_stacked data_stacked
continent country year
Africa Algeria 1952 43.077
1957 45.685
1962 48.303
1967 51.407
1972 54.518
...
Oceania New Zealand 1987 74.320
1992 76.330
1997 77.550
2002 79.110
2007 80.204
Length: 1704, dtype: float64
Now we need to convert the row indices continent
and country
to columns as in the melted data in Section 8.4.1.2.
#Putting 'continent' and 'country' as columns
= data_stacked.reset_index()
data_long_with_stack data_long_with_stack
continent | country | year | 0 | |
---|---|---|---|---|
0 | Africa | Algeria | 1952 | 43.077 |
1 | Africa | Algeria | 1957 | 45.685 |
2 | Africa | Algeria | 1962 | 48.303 |
3 | Africa | Algeria | 1967 | 51.407 |
4 | Africa | Algeria | 1972 | 54.518 |
... | ... | ... | ... | ... |
1699 | Oceania | New Zealand | 1987 | 74.320 |
1700 | Oceania | New Zealand | 1992 | 76.330 |
1701 | Oceania | New Zealand | 1997 | 77.550 |
1702 | Oceania | New Zealand | 2002 | 79.110 |
1703 | Oceania | New Zealand | 2007 | 80.204 |
1704 rows × 4 columns
Finally, we need to rename the column named as 0 to LifeExp
to obtain the same dataset as in Section 8.4.1.2.
#Renaming column 0 to 'LifeExp'
= {0:'LifeExp'},inplace=True)
data_long_with_stack.rename(columns data_long_with_stack
continent | country | year | LifeExp | |
---|---|---|---|---|
0 | Africa | Algeria | 1952 | 43.077 |
1 | Africa | Algeria | 1957 | 45.685 |
2 | Africa | Algeria | 1962 | 48.303 |
3 | Africa | Algeria | 1967 | 51.407 |
4 | Africa | Algeria | 1972 | 54.518 |
... | ... | ... | ... | ... |
1699 | Oceania | New Zealand | 1987 | 74.320 |
1700 | Oceania | New Zealand | 1992 | 76.330 |
1701 | Oceania | New Zealand | 1997 | 77.550 |
1702 | Oceania | New Zealand | 2002 | 79.110 |
1703 | Oceania | New Zealand | 2007 | 80.204 |
1704 rows × 4 columns